Author: Tiberiu Iancu
Date: 01.07.2025
This analysis looks into the performance of hyperbolic deep learning models from a compute perspective, in order to identify inefficiencies and bottlenecks in the implementation.
The mathematical operations themselves, as well as potential numerical optimizations are out of scope.
System: I ran the experiments on my own PC with Ubuntu 24.04, a RTX 2070 GPU with 8GB VRAM on CUDA 12.8 and i7 6800K CPU .
Workload: I chose to profile a simple 2-layer MLP with torch.compile, which creates triton kernels that fuse together consecutive element-wise operations. In simpler terms, compile speeds up execution by encouraging reuse of L2 cache in the GPU.
Dataset: I used Caltech256 (224x224 input, 256 classes), as I wasn't able to get access to ImageNet. I initially tried CIFAR10, but the low resolution resulted in low GPU occupancy, so I chose the more realistic scenario of higher resolution images.
For reference, we present a trace captured from the training of the Euclidean MLP. The figure below displays the different stages of the training: transforming and loading data into GPU memory, forward pass (+criterion computation), backward pass, and optimizer step.
Two things stand out: first, the GPU is continuously busy, and does not wait for the CPU to queue operations. Second, the bulk of the execution time is spent on matrix multiplications.
For simplicity, from now on we'll only inspect GPU traces. Below you can see a GPU trace from the forward pass.
Compared to the Eucliden MLP, the Hyperbolic model has additional computation to do. First, the tensors must be moved to the manifold. This operation can likely be optimized. Second, the forward pass now consists of two compute intensive operations: the euclidean norm of the weights, as well as the usual matrix multiplication operation between input and weights. This layer could benefit from some optimization, and as we'll see later, torch.compile can help here.
The backward pass looks even more poorly optimized. In this case, the two matrix multiplications take up only a small portion of the total computation. Here we would expect a lot of performance gain from fusing operations together.
compileWhen compiling the model, the performance improves significantly. The forward pass is now computed in 1.3ms, compared to 4.1ms in the non-compiled version. Torch seems to have fused most element-wise operations together, but more importantly, it optimized the launch parameters of the kernel, leading to a significant performance improvement in both euclidean norm, as well as matrix multiplication.
The compiled backward pass also improved performance by a factor of 4, from 28ms to only 7ms, mostly attributed to the fusing of the many element-wise operations:
The compilation also drastically improves the performance of the optimizer, taking the execution of one step from 50 to 16.7ms.
The execution footprint of one iteration is summarized in the table below:
| Move to manifold | Forward | Backward | Optimizer | Total | Peak memory (GB) | |
|---|---|---|---|---|---|---|
| Euclidean | 1.1 | |||||
| Hyperbolic | 1.4 | |||||
| Hyperbolic + compile | 1.4 |
For reference, we first inspect the GPU trace of a ResNet18 with batch size 8. Even on such a small architecture the GPU is mostly occupied executing convolution kernels. Here, the forward pass duration is 20ms, and the backward pass takes approximately 19ms.
Sadly, the hyperbolic network's computational graph is far too large, and analyzing such large traces is not feasible. Furthermore, I was unable to torch.compile larger hyperbolic resnets. I chose to profile a much smaller ResNet architecture with only ~1500 parameters and batch size 2.
Even such small model takes an unreasonable time for the forward pass, a staggering 2.25 seconds. Out of this, only 150ms is spent in the ResNet blocks, while over 2seconds is spent calculating the frechet mean.
When inspecting the ResNet blocks themselves, we identify a second bottleneck: hyperbolic BatchNorm. Even when avoiding the frechet mean by using the midpoint optimization, batch norm is still the most time-consuming layer. Kernel fusing can likely greatly improve performance.
Zooming in on the GPU stream in the conv layer, we see that very little time is spent actually executing kernels. Part of the reason for this is the small model size, so this result should be taken with a grain of salt. However, the conv2d implementation suggests that the operation is inefficient: the high-level torch implementation (unfold + dense + fold) makes poor use of the GPU. Optimizing this layer by re-writing the operation as a GPU kernel (e.g., using the Winograd algorithm) should greatly improve throughput.
I'll leave the backward pass out, as it doesn't provide any additional insight.
compileCompiling the model provides the biggest improvement we've seen so far: the frechet mean operation was reduced from over 2 seconds to 4ms!!!!!! This seems to be due to the fusion of many element-wise operations.
The runtime of the ResNet blocks was reduced from 150 to 60ms; we can no longer inspect performance of specific layers in the trace, as these have been fused together. Again, GPU utilization is very low due to the small model size. We'll leave this out of the analysis as it doesn't provide any insight.
The compiled ResNet peaked at 0.04GB memory used, while the uncompiled ResNet peaked at 0.1GB.
In this analysis we've seen that hyperbolic networks suffer from severe computational bottlenecks, mainly due to the high-level Torch implementation. Compiling the models goes a long way in optimizing operations. However, compiling does not seem to be the definitive solution to our problems: it acts rather like a band-aid on the inefficiency wound. There are three main drawbacks that come with compiling. First, the large overhead (even compiling the small ResNet takes a few minutes) slows down development. Second, compilation of larger models seems to be unstable and does not always succeed. Third, compilation results cannot be saved, meaning re-compilation must be performed on every run.
I believe that moving forward, many of the hyperbolic operations need their own GPU kernel in order to make ResNet viable at scale. The main candidates are: feed forward, conv, batch norm, and average pooling (frechet mean). From these, we've seen that compilation is fast and painless for the feed forward layer and produces very good results. Similarly, compiling the frechet mean operation is highly effective, and having to do this by hand would likely be extremely time-consuming (judging by the number of fused kernels in the GPU trace). This leaves the convolutional block and the batch norm. I believe these can benefit most from manual optimization, as internally they perform aggregations or more complex data manipulation.